Data on Entries in Presidential Diaries:

Origin: Presidential Daily Diaries scraped from the websites of the Presidential libraries. Truman and FDR Diaires hand transcribed (by libraries). OCR run on the rest - OCR by NARA on all except Reagan (OCR performed with Tesseract)

Unit of Analysis: Each row represents one diary entry as determined by the parsing algorithm

Columns:
Start: The start time of the meeting.  Always available, expect for a small percentage of the Johnson diaries, which use a slightly different parsing method.
End: End time of the meeting. Contains either the explicit end time, if one is available from parsing the schedule, or the implied end time (i.e., start time of the next meeting), where this implies a reasonable end time.  Missing only when no explicit time is provided and none can be inferred (e.g., last meeting of the day)
Duration: Start Time - End Time; where AM/PM is not available, these are inferred from context.  Meeting durations greater than 300 minutes are not allowed and have been recoded to MISSING, which is potentially consequential for analysis.  Total missingness on this variable is roughly 10%.
Activity: Text extracted by the parsing algorithm from the schedule that describes what the President did or the person with whom he met.  What is recorded in this field varies somewhat across administrations.
Date: Date, explicitly supplied, except in the Bush and Nixon administrations, where it must be read by OCR.
IsFopo: Does the "Activity" field contain a match to a designated foreign policy word? See below for details.
Match1,Match2,Match3: Foreign policy regex (if any) that the "Activity" field matched.  If more than 3, only the first 3 are listed (these are not in a meaningful order)
Product: For ease of aggregating the data, constructed by multiplying the "Duration" and "IsFopo" columns
indic: Count of the number of words in the "Activity" field that are found in the 333,333 most common words of the Google Web Trillion Word Corpus, and available from http://norvig.com/ngrams/count_1w.txt
words: Count of the number of words in the "Activity" field.  Consequently, indic/words is a very rough measure of OCR quality, although note that the dictionary used does NOT contain proper nouns that are almost certainly correct (e.g., Stettinius) and does contain common misspellings and typos that are almost certainly NOT correct (e.g. teh), although these are likely to be largely orthogonal to OCR error.

The dictionary used to detect foreign policy content contains the following general categories:
1) Generic words related to foreign affairs (e.g., international or diplomat), names of position related to foreign affairs (e.g., "Secretary of State"), and names of agencies involved in foreign affairs (e.g., "Central Intelligence").
2) Names of the countries of the world, except where these are likely to refer to other things (e.g., Jordan)
3) Names of foreign policy officeholders, during their time in office, generally simply matched by last name, except where this is likely to result in bad matches (e.g., because "Rogers" is a common name, Nixon's Secretary of State matches only to "William Rogers" and because William Rogers served as Attorney General in the Eisenhower administration, his name is a foreign policy match only in the years 1969 to 1973.)  The following offices are included: 
a) Secretaries of State
b) Secretaries of Defense (Secretaries of War and Navy pre-1947)
c) National Security Advisors (from 1953)
d) Deputy National Security Advisors (from 1961)
e) Directors of Central Intelligence (from 1947)
f) Chairmen of the Joint Chiefs of Staff (from 1949)
g) Other selected senior officials: Top military commanders during WW2, Korea, and Vietnam.  Deputy secretaries of State and Defense of particular prominence, particularly during the earlier administrations

Out of all matches, about 37% match at least two separate dictionary entries.  The five most common "solo" codes (i.e., only a single regex is matched) are (frequencies in parentheses)
1) Kissinger (5,275)
2) Secretary of State (4,841)
3) Gen. (4,693) - note we allow the abbreviation “Gen.” to match because this nearly always matches senior military officers.  We do not allow the word “General” to match foreign policy because this frequently refers to other things (e.g., the Attorney General)
4) Ambassador (3,663)
5) National Security (3,021)

These are also the five most common dictionary matches overall.

A total of 57 dictionary entries account for at least 100 “solo” matches, and a total of 285 dictionary entries account for at least 1 “solo” match.  

A total of 111 dictionary entries account match to the diaries at least 100 times, and a total of 310 entries match to the diaries at least once.  Replicators can produce a list of all such regular expressions with the call “sort(table(c(combined[,"Match1"],combined[,"Match2"],combined[,"Match3”])))” after loading in the attention data.





